Intel/Qwen3-8B-int4-AutoRound-inc

Model Details

This model is an int4 model with group_size 128 and symmetric quantization of Qwen/Qwen3-8B generated by intel/auto-round.

How To Use

INT4 Inference(CPU/CUDA/INTEL GPU)

from transformers import AutoModelForCausalLM,AutoTokenizer
quantized_model_dir = "Intel/Qwen3-8B-int4-AutoRound-inc"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir)
model = AutoModelForCausalLM.from_pretrained(
    quantized_model_dir,
    torch_dtype="auto",
    device_map="auto"
)

# prepare the model input
prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512,  ##change this to align with the official usage
    do_sample=False  ##change this to align with the official usage
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 

# parsing thinking content
try:
    # rindex finding 151668 (</think>)
    index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
    index = 0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

print("thinking content:", thinking_content)
print("content:", content)
##INT4:
# thinking content: <think>
# Okay, the user is asking for a short introduction to large language models. Let me start by recalling what I know about them. Large language models are a type of AI that can process and generate human-like text. They're based on deep learning, right? I should mention their training process, using massive datasets. Maybe explain how they work with neural networks, like transformer architectures. Also, their applications are important—like answering questions, writing, coding. But I need to keep it concise. Wait, the user wants a short intro, so I shouldn't go into too much detail. Let me structure it: start with the definition, mention the training data, the technology (transformers), and then the applications. Also, maybe touch on their capabilities, like understanding context and generating coherent text. Oh, and maybe note that they're used in various fields. I should avoid jargon but still be accurate. Let me check if I'm missing anything. Oh, maybe mention that they're pre-trained on a lot of text, which allows them to handle multiple tasks. Yeah, that's a key point. Alright, time to put it all together in a clear, concise way.
# </think>
# content: Large language models (LLMs) are advanced AI systems trained on vast amounts of text data to understand and generate human-like language. Built using deep learning techniques, particularly transformer architectures, they process and analyze patterns in text to perform tasks like answering questions, writing stories, coding, and more. These models leverage extensive training data to grasp context, syntax, and semantics, enabling them to engage in complex conversations and adapt to diverse applications across fields like education, healthcare, and technology. Their ability to generate coherent, context-aware responses makes them a cornerstone of modern natural language processing.

##BF16:
# thinking content: <think>
# Okay, the user wants a short introduction to large language models. Let me start by defining what they are. They're AI systems trained on vast amounts of text data, right? I should mention their ability to understand and generate human-like text. Maybe include examples like GPT or BERT. Also, highlight their applications in tasks like answering questions, writing, coding, and more. Keep it concise but cover the key points: training data, capabilities, and use cases. Avoid technical jargon to keep it accessible. Let me check if I need to mention the scale of the models, like the number of parameters. That's important for context. Oh, and maybe touch on how they process different languages. Wait, the user said "short," so I shouldn't go into too much detail. Let me structure it: definition, training, capabilities, applications. That should cover it. Make sure it's clear and to the point.
# </think>
# content: Large language models (LLMs) are advanced AI systems trained on vast amounts of text data to understand and generate human-like language. They can answer questions, write stories, code, translate languages, and perform various tasks by analyzing patterns in the data. These models, like GPT or BERT, leverage massive datasets and complex algorithms to produce coherent, context-aware responses, making them powerful tools for communication, creativity, and problem-solving across multiple domains.


prompt = "9.11和9.8哪个数字大"
##INT4:
# thinking content: 
# content: <think>
# 好的，我现在需要比较9.11和9.8哪个数字更大。首先，我应该回忆一下小数比较的方法。通常，比较小数的时候，我们会从左到右逐位比较，先看整数部分，如果整数部分相同，再比较小数部分。
# 首先，这两个数的整数部分都是9，所以整数部分相同。接下来比较小数部分。9.11的小数部分是0.11，而9.8的小数部分是0.8。这时候我需要比较0.11和0.8的大小。
# 为了更清楚地比较，我可以把它们转换成同一位数的小数。比如，0.11可以看作0.110，而0.8可以看作0.800。这样，比较每一位：
# 第一位小数：0.110的十分位是1，而0.800的十分位是8。显然，8比1大，所以0.800比0.110大。因此，9.8比9.11大。
# 不过，我是不是应该再检查一下？比如，有没有可能在比较小数时出现其他情况？比如，如果小数位数不同，是否需要补零？比如，9.11和9.8，后者的小数位数更少，所以补零到相同位数的话，确实是0.110和0.800，这样比较更直观。
# 或者，我可以将它们转换为分数来比较。9.11等于9又11/100，而9.8等于9又80/100。显然，80/100比11/100大，所以9.8更大。
# 另外，也可以用数值的大小来直观比较。比如，9.8等于9.80，而9.11是9.11，显然9.80比9.11大，因为小数点后第一位8比1大。
# 不过，有没有可能我哪里弄错了？比如，是否在比较小数时应该先比较整数部分，然后小数部分？是的，没错。整数部分相同的情况下，比较小数部分，所以正确。
# 或者，有没有可能把9.8看成9.80，而9.11是9.11，所以

##BF16:
# thinking content: 
# content: <think>
# 嗯，用户问的是9.11和9.8哪个数字大。首先，我需要确认这两个数字的结构。9.11是一个小数，而9.8也是一个小数。看起来都是以9开头，但后面的小数部分不同。
# 首先，我应该比较整数部分。两个数的整数部分都是9，所以这时候需要比较小数部分。小数部分的话，9.11的小数部分是0.11，而9.8的小数部分是0.8。这时候可能需要将它们转换成相同的位数来比较，或者直接比较小数点后的数字。
# 不过，可能用户不太清楚小数点后的位数如何比较。比如，9.11有两位小数，而9.8只有一位。这时候可能需要补零，让它们的小数位数相同。比如，9.8可以写成9.80，这样比较的话，0.80和0.11。这时候，0.80比0.11大，所以9.8比9.11大。
# 不过，也有可能用户会误以为小数点后的位数多的数字更大，比如认为9.11比9.8大，因为有两位小数。但其实小数点后的位数并不影响大小，而是要看每一位的数值。比如，第一位小数是十分位，第二位是百分位。所以，比较的时候，先看十分位，9.8的十分位是8，而9.11的十分位是1，显然8比1大，所以9.8更大。
# 不过，也有可能用户会混淆小数点后的位数，比如认为9.11的百分位是1，而9.8的百分位是0，所以可能误以为9.11更大。这时候需要明确说明比较的步骤，先比较整数部分，再比较小数部分，逐位比较。
# 另外，可能用户是在学习小数比较，所以需要详细解释步骤，确保他们理解如何正确比较小数。比如，先看整数部分，如果整数部分相同，再比较小数部分，从左到右逐位比较，直到找到不同的数字为止。
# 总结一下，正确的比较方法是：整数部分相同，比较小数部分，十分位8比1大，所以9.8更大。因此，答案应该是9.


prompt = "How many r in word strawberry"
##INT4:
# thinking content: 
# content: <think>
# Okay, let's see. The user is asking how many times the letter 'r' appears in the word "strawberry". Hmm, first I need to make sure I spell the word correctly. Let me write it out: S-T-R-A-W-B-E-R-R-Y. Wait, is that right? Let me check again. S-T-R-A-W-B-E-R-R-Y. Yes, that's strawberry. Now, I need to count the number of 'r's in there.
# Starting from the beginning: S is the first letter, then T, then R. So that's one 'r'. Then the next letters are A, W, B, E, R, R, Y. Wait, after the first R, there's another R later. Let me break it down letter by letter.
# S - no
# T - no
# R - yes, that's the first one
# A - no
# W - no
# B - no
# E - no
# R - second one
# R - third one
# Y - no
# So that's three 'r's? Wait, let me count again. The word is S-T-R-A-W-B-E-R-R-Y. So positions:
# 1. S
# 2. T
# 3. R
# 4. A
# 5. W
# 6. B
# 7. E
# 8. R
# 9. R
# 10. Y
# So the 'r's are at positions 3, 8, and 9. That's three 'r's. Wait, but sometimes people might miscount. Let me check once more. The word is strawberry. Breaking it down:
# Strawberry: S-T-R-A-W-B-E-R-R-Y. Yes, so the third letter is R, then after the E, there are two R's. So that's three R's. So the answer should be 3. But wait, maybe I'm missing something. Let me write it out again:
# S T R A W B E R R Y
# So letters:
# 1. S
# 2. T
# 3. R
# 4. A
# 5. W
# 6. B
# 7. E
# 8. R
# 9. R
# 10. Y
# Yes, three R's. So the answer is 3. I think that's correct. I don't see any other R's. Maybe the user is testing if I can count correctly, so I need to make sure. Let me check an online source or

##BF16:
# thinking content: 
# content: <think>
# Okay, let's see. The user is asking how many times the letter 'r' appears in the word "strawberry". Hmm, first I need to make sure I have the correct spelling of the word. Let me write it out: S-T-R-A-W-B-E-R-R-Y. Wait, is that right? Let me check again. Strawberry... yes, that's correct. Now, I need to count the number of 'r's in that spelling.

# Let me go through each letter one by one. Starting with the first letter: S. Not an 'r'. Next is T. Still not. Then R. Okay, that's one. Then A, W, B, E. So far, only one 'r'. Then comes the next letters: R. That's the second 'r'. Then another R. Wait, is there a third 'r'? Let me check again. The word is S-T-R-A-W-B-E-R-R-Y. So after the first R, there's a B, E, then two R's. So that's two R's? Wait, no. Let me count again. Let's break it down:

# 1. S
# 2. T
# 3. R (1st)
# 4. A
# 5. W
# 6. B
# 7. E
# 8. R (2nd)
# 9. R (3rd)
# 10. Y

# Wait, so the letters are S, T, R, A, W, B, E, R, R, Y. So the 'r's are at positions 3, 8, and 9. That's three 'r's. But wait, maybe I miscounted. Let me write it out again:

# S-T-R-A-W-B-E-R-R-Y. So after the first R (position 3), then the next letters are A, W, B, E, then R (position 8), then another R (position 9). So that's three R's. But sometimes people might miss the second R. Let me check again. The word is strawberry. Let me spell it again: S-T-R-A-W-B-E-R-R-Y. Yes, that's correct. So the R's are in the third, eighth, and ninth positions. Therefore, there are three 'r's. Wait, but I thought maybe the answer was two. Maybe I need to confirm. Let me check an online source or think of another way. Alternatively, maybe


prompt = "请简短介绍一下阿里巴巴公司"
##INT4:
# thinking content: <think>
# 好的，用户让我简短介绍一下阿里巴巴公司。首先，我需要确定用户的需求是什么。他们可能是在做研究，或者想了解阿里巴巴的基本信息，或者只是好奇。不管怎样，简短是关键，所以我要抓住核心点。
# 阿里巴巴成立于1999年，由马云创立，这点必须提到。然后是它的主要业务，比如淘宝、天猫、支付宝这些大家比较熟悉的平台。可能还要提到它的国际业务，比如阿里国际站和Lazada，显示其全球影响力。
# 另外，阿里巴巴集团的结构，比如旗下的不同子公司，比如淘宝、天猫、菜鸟网络、阿里云等，这些可以简要带过。还有它的愿景，比如成为全球最大的电子商务公司，以及在科技和创新方面的投入，比如阿里云和大数据技术。
# 用户可能还想知道阿里巴巴的影响力，比如用户数量、市场份额，或者它在科技领域的成就。不过因为要简短，可能不需要太多数据，但可以提到它是中国最大的互联网公司之一，全球知名的电商平台。
# 还要注意避免太技术性的术语，保持简洁易懂。可能需要检查是否有遗漏的重要信息，比如阿里巴巴的上市时间（2014年），或者其在社交媒体和物流方面的业务，比如菜鸟网络。不过这些可能属于次要信息，可以简略带过。
# 最后，确保整体结构清晰，先介绍成立时间、创始人，然后主要业务，再提到国际业务和科技贡献，最后总结其地位和愿景。这样用户就能快速了解阿里巴巴的基本情况。
# </think>
# content: 阿里巴巴集团（Alibaba Group）成立于1999年，由马云创立，是中国乃至全球最具影响力的互联网企业之一。其核心业务涵盖电子商务（如淘宝、天猫）、数字支付（支付宝）、云计算（阿里云）、物流（菜鸟网络）及全球化零售（Lazada、阿里国际站）等领域。阿里巴巴致力于通过技术创新推动商业变革，现已成为全球最大的电商平台之一，并在人工智能、大数据等前沿科技领域持续布局。2014年，其在美国纽约证券交易所上市（股票代码：BABA）。

##BF16:
# thinking content: <think>
# 嗯，用户让我简短介绍一下阿里巴巴公司。首先，我需要确定用户的需求是什么。可能他们只是想了解阿里巴巴的基本信息，或者是在做作业、写报告，或者只是好奇。不管怎样，用户需要的是简洁明了的回答，不能太冗长。
# 接下来，我得回忆一下阿里巴巴的基本信息。阿里巴巴集团成立于1999年，总部在杭州，对吧？创始人是马云，不过现在可能已经卸任了。阿里巴巴主要业务包括电商平台，比如淘宝、天猫，还有菜鸟网络、支付宝、阿里云这些子公司。可能还需要提到它的全球影响力，比如在电子商务、云计算、金融科技方面的成就。
# 用户可能还想知道阿里巴巴的规模，比如员工数量、市值，或者它在行业中的地位。不过用户要求简短，所以这些细节可能需要简化。另外，可能需要提到它的使命，比如让天下没有难做的生意，或者推动数字化转型。
# 还要注意避免错误信息，比如马云是否还在任。我记得他已经在2019年卸任董事局主席，现在是董事会成员。另外，阿里巴巴的上市时间是2014年，港交所和纽交所都有。可能需要提到这些关键点。
# 用户可能没有说出来的深层需求是希望了解阿里巴巴的核心业务和影响力，而不仅仅是表面信息。所以需要突出其作为全球最大的电商平台之一，以及在云计算和金融科技领域的领先地位。同时，可能还要提到其社会责任，比如环保、公益项目等。
# 需要确保回答结构清晰，分点列出，但用户要求简短，所以可能用一两句话概括。比如成立时间、总部、创始人、主要业务、全球地位等。还要注意用词准确，比如“集团”而不是“公司”，因为阿里巴巴有多个子公司。
# 最后检查有没有遗漏的重要信息，比如是否提到支付宝和阿里云，这些是阿里巴巴的重要组成部分。同时，可能需要提到其对中国经济和全球电商的影响，但保持简洁。确保没有错误，比如成立年份、上市地点等。这样用户就能得到一个全面但简短的介绍。
# </think>
# content: 阿里巴巴集团（Alibaba Group）成立于1999年，总部位于中国杭州，由马云创立。它是全球领先的互联网科技公司，核心业务涵盖电子商务（淘宝、天猫）、云计算（阿里云）、金融科技（支付宝）、物流（菜鸟网络）及创新业务（如盒马鲜生、阿里健康等）。阿里巴巴致力于通过数字化技术赋能企业与消费者，推动全球商业变革，旗下拥有

Evaluate the model

pip3 install lm-eval

auto-round-eval --model "Intel/Qwen3-8B-int4-AutoRound-inc" --eval_bs 16  --tasks leaderboard_ifeval,leaderboard_mmlu_pro,gsm8k,lambada_openai,hellaswag,piqa,winogrande,truthfulqa_mc1,openbookqa,boolq,arc_easy,arc_challenge,mmlu,cmmlu,ceval-valid

Metric	BF16	INT4(best)	INT4(default)
Avg	0.6184	0.6123	0.6063
arc_easy	0.8342	0.8295	0.8224
arc_challenge	0.5418	0.5496	0.5418
boolq	0.8673	0.8673	0.8654
ceval-valid	0.7912	0.7786	0.7741
cmmlu	0.7702	0.7588	0.7527
gsm8k 5 shots	0.8810	0.8643	0.8688
hellaswag	0.5708	0.5626	0.5615
lambada_openai	0.6400	0.6387	0.6305
leaderboard_mmlu_pro 5 shots	0.4759	0.4687	0.4676
leaderboard_ifeval inst_level_strict_acc	0.3957	0.3957	0.3789
leaderboard_ifeval prompt_level_strict_acc	0.2532	0.2477	0.2200
mmlu	0.7294	0.7209	0.7168
openbookqa	0.3140	0.3120	0.8654
piqa	0.7666	0.7628	0.7633
truthfulqa_mc1	0.3672	0.3574	0.3550
winogrande	0.6811	0.6827	0.6803

Generate the model

Here is the sample command to generate the model.

auto-round-best \
--model Qwen/Qwen3-8B \
--device 0 \
--group_size 128 \
--bits 4 \
--format 'auto_round' \
--output_dir "./tmp_autoround"

Ethical Considerations and Limitations

The model can produce factually incorrect output, and should not be relied on to produce factually accurate information. Because of the limitations of the pretrained model and the finetuning datasets, it is possible that this model could generate lewd, biased or otherwise offensive outputs.

Therefore, before deploying any applications of the model, developers should perform safety testing.

Caveats and Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.

Here are a couple of useful links to learn more about Intel's AI software:

Intel Neural Compressor link

Disclaimer

The license on this model does not constitute legal advice. We are not responsible for the actions of third parties who use this model. Please consult an attorney before using this model for commercial purposes.

Cite

@article{cheng2023optimize, title={Optimize weight rounding via signed gradient descent for the quantization of llms}, author={Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi}, journal={arXiv preprint arXiv:2309.05516}, year={2023} }

arxiv github

Intel
/

Qwen3-8B-int4-AutoRound-inc